Confident Learning: Estimating Uncertainty in Dataset Labels
نویسندگان
چکیده
Learning exists in the context of data, yet notions confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead quality by characterizing and identifying errors datasets, based principles pruning noisy counting with probabilistic thresholds to estimate noise, ranking examples train confidence. Whereas numerous studies have developed these independently, here, we combine them, building assumption a class-conditional noise process directly joint distribution between (given) labels uncorrupted (unknown) labels. This results generalized CL provably consistent experimentally performant. We present sufficient conditions where exactly finds errors, show performance exceeding seven recent competitive approaches for CIFAR dataset. Uniquely, framework coupled specific data modality or (e.g., use find several presumed error-free MNIST dataset improve sentiment classification text Amazon Reviews). also employ ImageNet quantify ontological class overlap estimating 645 missile images are mislabeled as their parent projectile), moderately increase accuracy ResNet) cleaning prior training. These replicable using open-source cleanlab release.
منابع مشابه
Robust supervised learning under uncertainty in dataset shift
When machine learning is deployed in the real world, its performance can be significantly undermined because test data may follow a different distribution from training data. To build a reliable machine learning system in such a scenario, we propose a supervised learning framework that is explicitly robust to the uncertainty of dataset shift. Our robust learning framework is flexible in modelin...
متن کاملLearning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels
P̃ Ñ learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate ρ1 for positive examples and ρ0 for negative examples. We propose Rank Pruning (RP) to solve P̃ Ñ learning and the open problem of estimating the noise rates. Unlike prior solutions, RP is efficient and general, requiring O(T ) for any unrestricted choice of probabili...
متن کاملBallpark Learning: Estimating Labels from Rough Group Comparisons
We are interested in estimating individual labels given only coarse, aggregated signal over the data points. In our setting, we receive sets (“bags”) of unlabeled instances with constraints on label proportions. We relax the unrealistic assumption of known label proportions, made in previous work; instead, we assume only to have upper and lower bounds, and constraints on bag differences. We mot...
متن کاملConfident Multiple Choice Learning
Ensemble methods are arguably the most trustworthy techniques for boosting the performance of machine learning models. Popular independent ensembles (IE) relying on naı̈ve averaging/voting scheme have been of typical choice for most applications involving deep neural networks, but they do not consider advanced collaboration among ensemble models. In this paper, we propose new ensemble methods sp...
متن کاملLearning with Multiple Labels
In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a nov...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Artificial Intelligence Research
سال: 2021
ISSN: ['1076-9757', '1943-5037']
DOI: https://doi.org/10.1613/jair.1.12125